# Low Delay-Variation Sub-/Near-Threshold Asynchronous-to-Synchronous Interface Controller for GALS Network-on-Chips

Weng-Geng Ho\*, Kwen-Siong Chong, Bah-Hwee Gwee and Joseph S. Chang Nanyang Technological University 50 Nanyang Avenue, Singapore 639798 \*howe0031@ntu.edu.sg

Abstract —We propose an Asynchronous-to-Synchronous Interface Controller (A2S-IC) with low delay-variation towards Process, Voltage and Temperature (PVT) variations for subthreshold/near-threshold operation in low power applications. This A2S-IC is targeted for a full-range Dynamic Voltage Scaling Global-Asynchronous-Local-Synchronous Network-on-Chip (NoC). There are three key attributes in this proposed A2S-IC. First, it is realized using static-logic (over dynamic-logic), hence is more appropriate for DVS (and subthreshold operation). Second, it is implemented using gate-level standard-cell to simplify the implementation efforts. Third, it is designed to share some internal nodes, hence reducing the redundant switching for data validity checking. The proposed A2S-IC is compared against its reported dynamic-logic counterpart; both are implemented in the same 65nm CMOS process. Based on the simulations conducted at 27°C, our proposed A2S-IC is more throughput-efficient at near- and subthreshold operations, featuring ~19% and ~66% faster throughput at  $V_{\rm DD}$  =0.5V and  $V_{\rm DD}$  =0.3V respectively. When the temperature variation (0°C to 100°C) is considered at the subthreshold operation, the proposed A2S-IC demonstrates 140% faster throughput than the reported design, the former only features up to 1.6× delay-variation but the latter exhibits up to 4× delay-variation. The proposed A2S-IC is able to operate at the voltage as low as 0.15V (as opposed to 0.3V for the reported design).

# I. INTRODUCTION

Designing multi-core processors realized by Network-on-Chip (NoC) [1]-[2] is a promising research area due to their improved speed (owing to the parallel processing) and possibly high energy-efficiency (contributed by the partial on-demand One of the main obstacles to the NoC computation). progression and popularity is due to the complexity of the data synchronization. The contemporary synchronous-logic (sync) methodology by means of a global clock to synchronize digital operations among different cores becomes increasingly challenging due to the timing margin of clock infrastructure for operation correctness. The situation further complicates the problem when the Process, Voltage and Temperature (PVT) Instead, an asynchronous-logic (async) methodology could be in part adopted to circumvent such data synchronization issue. The basic premise is that async circuits are essentially self-timed circuits by using async handshake protocols [3], [4] to synchronize digital operations.

Due to the lack of CAD tools and other testing protocols, fully async multi-core networks somewhat require excessive design efforts. As a result, the Global-Asynchronous-Local-Synchronous (GALS) scheme [5] (for inter-core async

handshake operation and intra-core sync clock operation) is often adopted for such multi-core networks. For this purpose, an Async-to-Sync Interface Controller (A2S-IC) [6] is required to convert the data from a router (async pipeline) to a processor core (sync pipeline), as illustrated in Fig. 1.

There are several A2S-ICs reported to-date for various applications. Some A2S-ICs [7]-[9] interfaced with the sync pipeline from the async pipeline with matched-delay single-rail protocol, are not considered in our study due to their low robustness towards PVT variations. In contrast, A2S-ICs [6],[10] which interfaced with the sync pipeline from the async pipeline with Quasi-Delay-Insensitive (QDI) dual-rail protocol are considered in our design. Particularly, the reported dynamic-logic A2S-IC [6] is advantageous due to its simpler structure, lesser transistor count, and hence relatively higher speed and lower power operation. It consists of a Sync Data Generator, a Sync Valid Generator and a Completion Detection, collectively converting the async dual-rail data to sync single-rail data, and acknowledging the async pipeline for evaluation/reset phases.

Nevertheless, the reported dynamic-logic A2S-IC poses several design challenges when the A2S-IC is targeted for a GALS NoC with full-range Dynamic-Voltage-Scaling (DVS) from nominal to near-threshold down to sub-threshold voltage regions. Particularly, in near- to sub-threshold operations, the on-current ( $I_{\rm on}$ ) of the transistor, consequently the associated circuit delay, are exponentially dependent on PVT variations [4]. Hence, not unexpectedly, the delay-variation (and the worst-case delay) of the A2S-IC increases significantly, hence unnecessarily slowing down the entire system.



Fig. 1: An NoC Example

In this paper, we propose a low delay-variation A2S-IC (over the dynamic-logic A2S-IC) for DVS GALS NoCs. The proposed A2S-IC adopts static-logic for low delay-variation at near- to sub-threshold voltage regions. Besides, the implementation of the proposed A2S-IC is based on the

commercial standard cells, for simplifying the design efforts. Furthermore the proposed A2S-IC avoids the redundant data validity checking by integrating the individual Sync Validity Generator and Completion Detection in the A2S-IC. Based on the simulations conducted in 65nm CMOS process (@ 27 $^{\circ}$ C), the proposed design features ~19% and ~66% faster throughput at  $V_{\rm DD}$  =0.5V and  $V_{\rm DD}$  =0.3V respectively than the reported design. Over the temperature variations (0°C to 100°C) at subthreshold operation, the proposed A2S-IC features a 140% faster throughput than the reported design. In addition, the proposed design exhibits only up to 1.6× delay-variation compared to 4× delay-variation in the reported design. Furthermore, the proposed A2S-IC can operate at 0.15V supply voltage when compared to 0.3V for the reported design.

This paper is organized as follows. Section II describes our proposed A2S-IC based on the static-logic standard-cell approach, and simulation results are presented in Section III. Finally, conclusions are drawn in Section IV.

#### II. PROPOSED A2S INTERFACE

## A. Interface Strcuture

The interface between async QDI and sync clock pipeline for one-bit data transfer between a four-phase dual-rail system and a standard single-rail sync system is illustrated in Fig. 2. The data signal from an async QDI pipeline, represented by the true rail A.T and false rail A.F, serve as the inputs to the A2S-IC. The acknowledgement handshake signal Lack is generated from the A2S-IC to async QDI pipeline once the data has been processed. At the sync module (which includes sync latches), SD represents the sync data signal and SV represents the sync validity signal, accompanied by the single-phase clock signal CLK. SVC is the latched SV value in the succeeding sync pipeline after a clock cycle, and then returned to the A2S-IC for acknowledgement purpose. Request signal (Req) is sent to Stoppable Clock to control CLK and, grant signal (Grant) is returned back to the A2S-IC.

The truth table for the A2S-IC input signals A.T and A.F, and output signals SD and SV are tabulated in Table I. When the data is empty (both A.T and A.F = `0`), SD remains the same as the previous state and SV is `0`, showing that the sync data is empty. When A.T is `1` (A.F is `0`), it implies that the dual-rail data is valid and carries bit information of `1`, thus asserting SV and SD to `1`. In contrast, when A.F is `1` (A.T is `0`), the validity of the data asserts SV to `1` but bit

information of '0' negates SD to '0'. Note that in the async dual-rail rule, the simultaneous presence of '1' for both A.T and A.F is not possible.

TABLE I. INPUTS AND CORRESPONDING OUTPUTS FOR A2S-IC

| Async QDI Interface |     | Sync Clock Interface |           |
|---------------------|-----|----------------------|-----------|
| A.T                 | A.F | SV                   | SD        |
| 0                   | 0   | 0                    | No Change |
| 0                   | 1   | 1                    | 0         |
| 1                   | 0   | 1                    | 1         |
| 1                   | 1   | Not Valid            | Not Valid |

## B. Proposed A2S-IC

We propose a low delay-variation A2S-IC for Global-Synchronous-Local-Asynchronous (GALS) Network-on-Chip (NoC) that enables dynamic voltage scaling for low-power/high-speed applications. Fig. 3(a) depicts the proposed A2S-IC Sync Data Generator, which accepts dual-rail inputs A.T and A.F to generate SD. Fig. 3(b) depicts the Sync Validity Generator integrated with Completion Detection, as different from separated individual parts in the reported design. The integration part consistently asserts/negates SV based on A.T/A.F and  $SD/\overline{SD}$  validities and feedback SVC from the sync module, accompanied by CLK. SV is inverted to generate Lack.



Fig. 3: Proposed A2S-IC: (a) Sync Data Generator, and (b) Sync Validity Generator integrated with Completion Detection



Fig. 2: A2S Interface Streuture

The operation for the proposed A2S-IC is as follows. Initially, A.T/A.F is empty, SD, SV and SVC are '0', and Lack is '1'. When A.T/A.F is arriving (valid), SD is asserted/remains unchanged. SV is asserted to '1' (valid), thus negating Lack to '0', showing that the A2S data conversion is completed and ready for reset. SV is latched to the next sync pipeline after one CLK cycle and SVC is asserted to '1' ( $\overline{SVC}$  = '0'). SV is now negated to '0' ( $\overline{SV}$  = '1'). Once A.T/A.F is empty, Lack is asserted to '1', showing that the reset is completed and ready for next cycle for A2S data conversion.

For comparison, Fig. 4(a) depicts the reported A2S-IC Sync Data Generator, which depends on the dual-rail inputs A.T and A.F to generate SD. Fig. 4(b) depicts the Sync Validity Generator, consisting of domino/dynamic logic to consistently assert/negate SV based on A.T/A.F and  $SD/\overline{SD}$  validities and feedback SVC from sync side, accompanied by CLK. Fig. 4(c) depicts the Completion Detection to accept A.T/A.F and  $\overline{SV}$  validities to generate Lack for async acknowledgement.



Fig. 4: Reported A2S-IC Parts: (a) Sync Data Generator, (b) Sync Validity Generator, and (c) Completion Detection

In the proposed A2S-IC, the Sync Validity Generator can be integrated with Completion Detection since SV can detect both the validity and neutrality of A.T/A.F and  $SD/\overline{SD}$  (via AND and OR gates combination). This is due to the evaluate (for validity detection) and reset (for neutrality detection) abilities in the standard cells. In contrast, the reported Sync Validity Generator detects only the validity of A.T/A.F and  $SD/\overline{SD}$  in the NMOS pull-down series (see Fig. 4(b)). Therefore the reported Completion Detection requires a NOR gate and a C-Muller with the generated  $\overline{SV}$  in order to detect the neutrality of the above-mentioned relevant signals (see Fig. 4(c)). In other words, the reported design evaluates the data validity twice in both the Sync Validity Generator and Completion Detection, and the proposed design avoids the redundant data validity checking.

Moreover, the reported A2S-IC adopts domino/dynamic logic for data convert operation, where the output states (SD, SV, Lack) are maintained by using the staticizers (weak

feedback inverters), in order to ensure the proper operation and serve as implicit latches. At the nominal condition, the design, which adopting dynamic logic, potentially dissipates lower leakage power (due to lesser transistor count) when compared to the static logic counterparts. In term of operating speed, dynamic logic is often faster due to low load capacitance.

### C. Advantageous Features

Table II depicts the general features of the reported and the proposed A2S-ICs. For fair comparison, the standard transistor sizing for data-convert (with speed optimization) is adopted for both the reported and proposed designs. When the operating voltage is scaled downwards, the design based on the dynamic logic can be more delay-sensitive due to the reduced  $I_{on}/I_{off}$ ratio. This is observed when relatively larger normalized delay (than static logic) is required at the near-threshold and subthreshold voltage regions, resulting in longer worst-case critical path delay and hence unnecessarily slowing down the entire system. In contrast, the proposed design adopts the static-logic implementation, which is lower delay-variation even from near-threshold to sub-threshold voltage region. Lower delayvariation implies that the proposed design does not compromise additional unnecessary delay when scaling the operating voltage downwards, as compared to the reported counterpart. Since the speed of the entire GALS NoC depends on the worst-case delay in the individual block (due to the gatelevel fine-grained pipeline realization), lower delay-variation in the A2S-IC (for data convert) allows higher operating speed in the overall system.

TABLE II. COMPARISON OF GENERAL REATURES OF THE PROPOSED AND REPORTED A2S-IC

| A2S-ICs                               |                     | Proposed        | Reported [6]     |
|---------------------------------------|---------------------|-----------------|------------------|
| CMOS Process                          |                     | 65nm            | 65nm             |
| Logic Family                          |                     | Static-Logic    | Dynamic-Logic    |
| Cell Library                          |                     | Standard-Cell   | Full Custom-Cell |
| Implementation Level                  |                     | Gate-Level      | Transistor-Level |
| Transistor Sizing                     |                     | Standard-Sizing | Standard-Sizing  |
| Nominal V <sub>DD</sub>               | Delay-<br>Variation | Low             | Low              |
|                                       | Speed               | Moderate        | Fast             |
| Near-<br>Threshold<br>V <sub>DD</sub> | Delay-<br>Variation | Low             | Moderate         |
|                                       | Speed               | Moderate        | Moderate         |
| Sub-<br>Threshold                     | Delay-<br>Variation | Low             | High             |
| $V_{ m DD}$                           | Speed               | Moderate        | Slow             |

# III. SIMULATION RESULTS

The reported and proposed A2S-ICs are simulated using Cadence Virtuoso and Synopsys Nanosim in 65nm CMOS process. We define the normalized delay (to nominal voltage room temperature condition) as the ratio of the required longer delay to the delay at the nominal condition ( $V_{\rm DD}=1\rm V$ , Temperature = 27°C). Higher normalized delay refers to higher delay-variation from the nominal condition towards the PVT variations and vice-versa.

The results of the reported and proposed designs at the near-threshold (0.5V) and sub-threshold (0.3V) voltages are tabulated in Table III. The throughput refers to the maximum operable frequency for the design. From the table, at  $V_{\rm DD}$  =0.5V and  $V_{\rm DD}$  =0.3V, the proposed design performs ~19% faster and ~66% faster respectively, showing the excellent low delay-variation with respect to the operating voltage variation.

|              | Throughput (GHz) |       |
|--------------|------------------|-------|
| $V_{ m DD}$  | 0.5V             | 0.3V  |
| Reported [6] | 0.53             | 0.05  |
| Proposed     | 0.63             | 0.083 |
| Improvement  | 19%              | 66%   |

Fig. 5 depicts the normalized delay (to nominal condition) for the reported and proposed A2S-ICs for various operating voltage at room temperature. We remark the following observations. First, the normalized delay for both designs increases as  $V_{\rm DD}$  is scaled downwards. This is mainly due to the larger clock period required to perform slower data converting for operation correctness. Second, the proposed design has a lower normalized delay over the voltage range since the low delay-variation feature of the static logic causing the circuit to be more tolerance to the voltage scaling. Third, from  $V_{\rm DD}$  =1V to 0.3V, the proposed A2S-IC features only up to ~28.6× delay-variation, as opposed to up to ~58.8× for the reported design. Fourth, the proposed design can operate at  $V_{\rm DD}$  as low as 0.15V (compared to 0.3V for the reported counterpart).



Fig. 5: Normalized Delay (to nominal condition) for Various  $V_{\rm DD}$  at Room Temperature



Fig. 6: Normalized Delay (to nominal condition) for Various Temperature at  $V_{\rm DD} = 0.3 \rm V$ 

Fig. 6 depicts the normalized delay (to nominal condition) of both A2S-ICs for various temperature at  $V_{\rm DD} = 0.3 \, \rm V$ . We remark the following observations. First, the normalized delay decreases when the temperature increases due to the sub-

threshold operation effects. Second, the proposed design features a lower normalized delay over the entire range of temperature. Third, the normalized delay change for the reported design is much larger, ranging from  $\sim 30 \times$  at  $100^{\circ}$ C to  $\sim 120 \times$  at  $0^{\circ}$ C, which is up to  $4 \times$  delay-variation. In contrast, the proposed design has only up to  $1.6 \times$  delay-variation (ranging from  $\sim 25 \times$  at  $100^{\circ}$ C to  $\sim 40 \times$  at  $0^{\circ}$ C).

## IV. CONCLUSIONS

A low delay-variation A2S-IC has been proposed for GALS NoCs and benchmarked against the reported A2S-IC. The proposed A2S-IC was realized using static-logic, which has low delay-variation at the sub-threshold voltage region and designed using the gate-level standard library cells, for ease of implementation. Besides, the proposed A2S-IC integrates the Sync Validity Generator with Completion Detection to avoid the redundant switching for data validity checking. Based on the 65nm CMOS process @ 27°C, our proposed A2S-IC has featured  $\sim$ 19% and  $\sim$ 66% faster throughput at  $V_{\rm DD}$  =0.5V (near-threshold voltage) and  $V_{\rm DD}$  =0.3V (sub-threshold voltage) respectively than the reported design. Over the temperature variation of 0°C to 100°C at sub-threshold operation, the proposed A2S-IC even featured 140% faster throughput with only up to 1.6× delay-variation (as compared to up to 4× delayvariation in reported design). The proposed A2S-IC had demonstrated to be functional at 0.15V supply voltage (as opposed 0.3V for the reported design).

#### ACKNOWLEDGEMENT

This research work was supported by Agency for Science, Technology and Research, Singapore, under SERC 2013 Public Sector Research Funding, Grant No: SERC1321202098. The authors thank A\*STAR for the kind support in funding this research.

## REFERENCES

- [1] D. Rostislav, V. Vishnyakov, E. Friedman, and R. Ginosar, "An asynchronous router for multiple service levels networks on chip," *IEEE ASYNC*, pp. 44-53, Mar. 2005.
- [2] F. Feliciian, and S. B. Furber, "An asynchronous on-chip network router with quality-of-service (QoS) support," *IEEE Int. SoC Conf.*, pp. 274-277, Sep. 2004.
- [3] K.-S. Chong, B.-H. Gwee and J. Chang, "Energy-efficient synchronous-logic and asynchronous-logic FFT/IFFT processors," *IEEE JSSC*, vol. 42, no. 9, pp. 2034–2045, Sep 2007.
- [4] T. Lin, K.-S. Chong, J. S. Chang and B.-H. Gwee, "An ultra-low power asynchronous-logic in-situ self-adaptive VDD system for wireless sensor network," *IEEE JSSC*, vol. 48, no. 2, pp 573-586, Feb. 2013.
- [5] K.-S. Chong, K.-L. Chang, B.-H. Gwee and J. Chang, "Synchronous-logic and Globally-Asynchronous-Locally-Synchronous (GALS) acoustic digital signal processors," *IEEE JSSC*, vol. 47, no. 3, pp. 769–780. Mar 2012.
- [6] A. J. Martin, and M. Nystrom, "Asynchronous techniques for system onchip designs," *IEEE Proc.* vol. 94, no. 6, pp. 1089–1120, Jun 2006.
- [7] A. E. Sjogren and C. J. Myers, "Interfacing synchronous and asynchronous modules within a high-speed pipeline," *IEEE TVLSI*, vol. 8, no.5, pp.573-583, Oct 2000.
- [8] T. Chelcea and S. M. Nowick, "Robust interfaces for mixed-timing system," *IEEE TVLSI*, vol. 12, no. 8, pp.857-873, Aug 2004.
- [9] R. Dobkin, R. Ginosar and C. P. Sotiriou, "High rate data synchronization in GALS SoCs," *IEEE TVLSI*, vol. 14, no. 10, pp.1063-1074, Oct 2006.
- [10] E. Beigne and P. Vivet, "Design of on-chip and off-chip interfaces for a GALS NoC Architecture," *IEEE ASYNC*, pp. 172, Mar 2006.